Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Citation detector with Metrics #126

Merged
merged 1 commit into from
Oct 18, 2024
Merged

Conversation

matt-bernhardt
Copy link
Member

@matt-bernhardt matt-bernhardt commented Oct 17, 2024

This integrates the citation detector with the monthly metrics setup. It includes a tweak to one of the term fixtures, because the ISBN fixture had ended up getting counted in the citation detector (which is a real phenomenon, but our test fixtures feel like they should be crafted to keep things separate).

Quick note about the side effect

The fact that this needed to tweak a different term fixture is an indication that our detectors are starting to overlap. This is a real phenomenon in production, and is expected as we start to build more detectors.

I do not have a notion for now about how to address this in the more constrained test environment, where we can carefully craft fixtures to do exactly what we intend. It feels like this behavior should be confirmed via explicit tests that address this overlap, but the tests we have so far are focused on one-and-only-one detector, so I feel okay about tweaking things to keep them separate.

We do have a term fixture that trips multiple detectors, confirming that they get counted in all applicable categories, so we may be covered already?

Developer

Ticket(s)

https://mitlibraries.atlassian.net/browse/TCO-107

Accessibility

  • ANDI or Wave has been run in accordance to our guide and
    all issues introduced by these changes have been resolved or opened
    as new issues (link to those issues in the Pull Request details above)
  • There are no accessibility implications to this change

Documentation

  • Project documentation has been updated, and yard output previewed
  • No documentation changes are needed

ENV

  • All new ENV is documented in README.
  • All new ENV has been added to Heroku Pipeline, Staging and Prod.
  • ENV has not changed.

Stakeholders

  • Stakeholder approval has been confirmed
  • Stakeholder approval is not needed

Dependencies and migrations

NO dependencies are updated

YES migrations are included

Reviewer

Code

  • I have confirmed that the code works as intended.
  • Any CodeClimate issues have been fixed or confirmed as
    added technical debt.

Documentation

  • The commit message is clear and follows our guidelines
    (not just this pull request message).
  • The documentation has been updated or is unnecessary.
  • New dependencies are appropriate or there were no changes.

Testing

  • There are appropriate tests covering any new functionality.
  • No additional test coverage is required.

** Why are these changes being introduced:

Now that the Detection::Citation model is in place, we need to integrate
it with the existing Metrics reporting.

** Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/tco-107

** How does this address that need:

* Adds a citation field (integer) to the metrics_algorithms table

* Defines a match_citations method in the Algorithms model, to pass
  search events through the Citation detector and count those where one
  is found (via the detection? method). This also includes expanding the
  event_matches logic to account for this in the unmatched calculations.

* Creates a new search_event fixture, using the citation term fixture,
  to give the algorithm something to count.

* As a side effect, we have to tweak the isbn term fixture, because it
  was unintentionally tripping the citation as well. See the side effect
  section for a longer discussion of this.

* Adds tests to the Algorithm model to account for everything above,
  copying from other calculations.

** Document any side effects to this change:

The fact that this needed to tweak a different term fixture is an
indication that our detectors are starting to overlap. This is a real
phenomenon in production, and is expected as we start to build more
detectors.

I do not have a notion for now about how to address this in the more
constrained test environment, where we can carefully craft fixtures to
do exactly what we intend. It feels like this behavior should be
confirmed via explicit tests that address this overlap, but the tests
we have so far are focused on one-and-only-one detector, so I feel
okay about tweaking things to keep them separate.

We do have a term fixture that trips multiple detectors, confirming
that they get counted in all applicable categories, so we may be
covered already?
@matt-bernhardt matt-bernhardt marked this pull request as ready for review October 17, 2024 21:53
@mitlib mitlib temporarily deployed to tacos-api-pipeline-pr-126 October 17, 2024 21:55 Inactive
@JPrevost JPrevost self-assigned this Oct 18, 2024
journal_exact = process_journals(event, matches)
suggested_resource_exact = process_suggested_resources(event, matches)
lcshs = match_lcsh(event, matches)

matches[:unmatched] += 1 if ids.detections.blank? && lcshs.detections.blank? && journal_exact.count.zero? && suggested_resource_exact.count.zero?
matches[:unmatched] += 1 if ids.detections.blank? && !citation.detection? && lcshs.detections.blank? && journal_exact.count.zero? && suggested_resource_exact.count.zero?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to normalize Citation a bit more after seeing how it behaves differently. Not in this PR, but just eventually. This could take the form of adding the .detection? to all of our Detectors and/or normalizing what .detection returns. No changes are should be made as part of this work, but this difference made me pause and have to understand why it was done this way. It makes sense based on how Citation currently works but feels like it may be confusing in the future as to why it is different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that having a more standardized set of methods across our detectors would be helpful. Some differences might be unavoidable (the citation detector doesn't extract specific patterns, while the standard identifier one does), but looking now I think the others (which all treat the phrase holistically) could benefit from a .detection? method.

@matt-bernhardt matt-bernhardt merged commit 53b4e6a into main Oct 18, 2024
3 of 5 checks passed
@matt-bernhardt matt-bernhardt deleted the tco-97-metrics branch October 18, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants